Modified recurrent equation-based cubic spline interpolation for missing data recovery in phasor measurement unit (PMU)

Background Smart grid systems require high-quality Phasor Measurement Unit (PMU) data for proper operation, control, and decision-making. Missing PMU data may lead to improper actions or even blackouts. While the conventional cubic interpolation methods based on the solution of a set of linear equations to solve for the cubic spline coefficients have been applied by many researchers for interpolation of missing data, the computational complexity increases non-linearly with increasing data size. Methods In this work, a modified recurrent equation-based cubic spline interpolation procedure for recovering missing PMU data is proposed. The recurrent equation-based method makes the computations of spline constants simpler. Using PMU data from the State Load Despatch Center (SLDC) in Madhya Pradesh, India, a comparison of the root mean square error (RMSE) values and time of calculation (ToC) is calculated for both methods. Results The modified recurrent relation method could retrieve missing values 10 times faster when compared to the conventional cubic interpolation method based on the solution of a set of linear equations. The RMSE values have shown the proposed method is effective even for special cases of missing values (edges, continuous missing values). Conclusions The proposed method can retrieve any number of missing values at any location using observed data with a minimal number of calculations.


Introduction
The worldwide growing power systems highlight the need for better monitoring and control mechanisms to avoid major blackouts.Smart grids are intelligent systems that facilitate the development of communication, network, and computing technologies, protocols, and standards to integrate power system elements for two-way communication.This timesynchronized high-precision measurement device that is also known as a synchrophasor or Phasor Measurement Unit (PMU), gives clear information on the working of the entire grid.The PMU is used to monitor and control the power grid.It can help in providing real-time measurements by eliminating adverse conditions like blackouts.These combined characteristics of data availability, timeliness, and communication network contribute to the better performance of the PMU system.Although the role, impact, 1 architecture, technology, 2 applications, functionality, standards, and evolution of PMU (timing, measurement, communication, and data storage) have been released since 1995, the North American Synchro Phasor Initiative (NASPI) has highlighted the importance of data quality. 35][6] Generally, incomplete or missing data might affect the functionality of the entire system. 7Hence, a way to handle missing values in PMU is mandatory for the effective functioning of the entire grid system.
With the advent of PMU systems, large datasets are generated and finding missing values using traditional cubic interpolation methods take larger computational time with the increase in data size.In this paper, a modified recurrent equation-based method termed the Alpha Method (AM) for PMU missing data problem is proposed.In this approach, a series of linear equations are solved using the modified recurrent equation to obtain a relationship between points on a spline, which is then used to estimate any missing values on the spline.We compare the proposed method to the more traditional method of solving linear equations, namely using tri-diagonal matrix or termed as the Linear Equations Method (LEM) in this paper.The proposed AM is computationally more efficient and takes less time to process than the LEM.Moreover, in real-time systems when the dataset grows progressively, we show that AM is better than LEM.

Literature review
The need to recover missing values in PMU data is vital to the proper operation of smart grids and the energy infrastructure.Literatures [5][6][7] indicate that missing data in PMU systems can negatively affect the accuracy of decision-making process and additionally, introduce security risks to the infrastructure.][10][11][12] Despite that, this approach is still largely theoretical and even so, viable methods utilizing this approach have only been tested on simulated data.
Alternatively, interpolation-based missing data recovery techniques [13][14][15] propose the reconstruction of missing values by a spatial interpolation or spatio-temporal interpolation of the values.Some work 16,17 even suggested advanced approaches utilizing k-nearest neighbors and recurrent relation-based interpolations.However, in interpolation-based techniques, historical data such as channel or time data is needed for more accurate calculations.interpolation.As such, there is a need to design effective data recovery methods to work without the need for historical data processing. 3So, a data-driven recovery technique capable of recovering missing entries with available or observed data is much needed.Moreover, the technique should not become overly complex or require high computational time as the size of the data grows.

Methods
Cubic spline interpolation is a widely used polynomial interpolation method for functions of one variable.Let f be a function from R to R. It is assumed that the value of f is known only at x 1 ≤ x 2: ≤ x i … ≤ x n and let f x i ð Þ¼a i : Piecewise cubic spline interpolation is the problem of finding the b i , c i and d i coefficients of the cubic polynomials SF i for 0 ≤ i ≤ n À 1 written in the form: REVISED Amendments from Version 2 We have addressed the reviewer's concerns, namely the consistency in nomenclature and accuracy of equation numbers.
Additionally, we included brief discussions about the results, characteristics of the dataset, and choice of evaluation metrics.
Any further responses from the reviewers can be found at the end of the article Where x can take any value between x i and x iþ1 .That is, Let the first-order derivative of equation (1) be: The first-order derivative at x i for values of 1 ≤ i ≤ n À 1 will be And the second-order derivative be: The second-order derivative at x i for values of 1 ≤ i ≤ n À 1 will be: For a smooth fit between the adjacent pieces, the cubic spline interpolation requires that the following conditions hold: 1.The cubic functions should intersect at the points left and right, for i ¼ 0 to n À 1 For each cubic function to join smoothly with its neighbors, the splines should have continuous first and second derivatives at the data points i ¼ 1,…, n À 1 : If h i = x iþ1 À x i and if h i is equal for all i values, following Revesz, 17 the relation between coefficients a i and c i can be resolved: Equation ( 7) represents a system of linear equations for the unknowns c i for 0 ≤ i ≤ n.As the values of a i are known, the value of c i can be found by solving the tri-diagonal matrix-vector equation Ax ¼ B. While there are n+1 numbers of c i constants, equation (7) yields only (n-2) equations.Based on the nature or type of spline assumed two more equations representing the boundary conditions of the spline.In general, two types of splines may be considered: natural cubic spline and clamped cubic spline.
For natural cubic spline interpolation, the following boundary conditions are assumed: c 0 ¼ c n ¼ 0:0.That is, the second derivatives of the splines at the endpoints are assumed to be zero.Based on equation (7), a system of (N+1) linear equations of (N+1) variables can be formulated as: ,and For clamped cubic spline interpolation the following boundary conditions are assumed: b , where the derivatives f 0 x 0 ð ) and f 0 x n ð ), are known constants.Thus, based on the boundary conditions assumed both natural and cubic splines result in n+1 system of linear equations.The resulting system of n+1 linear equations can be used to get unique solutions by any of the standard methods for solving a system of linear equations.
Once the values of c i are found, the b i and d i values can be obtained using equations ( 8) and ( 9) respectively.Similarly, under clamped spline interpolation, Recurrent equation-based solution Revesz, 17 chose boundary conditions that need to solve the tri-diagonal system given in equation ( 7) where x i rational variables e i rational constants, r is a non-zero rational constant and A is: The first row of the new matrix in equation ( 12) is shown to be equivalent to the first row of the clamped b matrix e 1 is where, e c 1 is an estimate of c 1 and r = 2+√3≈3:732. 17e chosen boundary conditions are such that the first row of the new matrix was the same as that of clamped cubic spline and while that of the last row was that of the natural cubic spline fixing the value of c n as 0. Using equation (12), the relationships between successive spline points can be obtained as: Let ∝ 0 , ∝ i for 1 < i ≤ n À 1 and ∝ n , respectively be: Based on the above, the closed form of solution for x i can be given as: The above equation ( 16) solves x i no matter exactly what the initial values for e i .This leads to a faster evaluation of the cubic spline than solving a tri-diagonal system.The major advantage of the method is when new measurements are added to the system.While conventional tri-diagonal matrix-based algorithm requires a complete redo of the entire computation, equation ( 16) leads to a faster update for each i ≤ n only with the addition of the term: and x nþ1 ¼ ∝ nþ1 : Similarly, ∝ i constants can be updated by adding a single term e nþ1 The system of linear equations given in equation (7), in general, is solved by the standard solution of linear equations in the matrix form Ax ¼ b: Alternatively, it could be solved for n variables by the recurrence relations given equations ( 16) and (17).The two methods, the first using the tri-diagonal matrix-based solution for the spline coefficients is termed the Linear Equations Method (LEM) and the second one using recurrence relations is termed the Alpha Method (AM).
The algorithmic procedure for LEM and AM are given below.
Algorithmic procedure for regular tridiagonal matrix-based Linear Equation Method (LEM) Step 1: Given the initial vector with missing values, separate them into two sets of vectors, the observed values vector R obs and the missing values vector R Miss , having sizes of NO and NM, respectively, such that NO+NM=N.
Step 2: R obs vector at x i values of the (NO-1) splines shall be the a i coefficient vector.
Step 3: Using a i , generate the RHS vector B given in equation (11).
Step 4: Generate a square coefficient matrix A as given in equation ( 11) Step 5: Solve for the c i vector is given in (11), using the relation Ax = B Step 6: Applying c i in equations ( 8) and ( 9), compute the b i and d i coefficient vectors for n-2 points of the R obs .
Step 7: Using the values of a i , b i , c i and d i , missing values can be found by the equation (1) re-written as: Where x represents the missing positions, between x i and x iþ1 of spline i.

Algorithmic procedure for recurrent equation-based Alpha Method (AM)
Step 1: Given the initial vector with missing values, separate them into two sets of vectors, the observed values vector R obs and the missing values vector R Miss , having sizes of NO and NM, respectively, such that NO+NM=N.
Step 2: The R obs vector at x i values of the (NO-1) splines is the a i coefficient vector.
Step 3: Using a i , generate the RHS vector B given in equation (11).
Step 4: Set ∝ 0 ¼ 0 and ∝ n ¼ e n , calculate the alpha vector using the relation.
for i values ranging from 1 to NO-1 Step 5: Set x n ¼ ∝ n and solve for c i values using the relation.
Step 6: Applying c i in equations ( 8) and ( 9), compute the b i and d i coefficient vectors for n-2 points of the R obs .
Step 7: Using the values of a i , b i , c i and d i , missing values can also be found using equation (18), re-written here again for convenience: Where x represents the missing positions, between x i and x iþ1 of spline i.
The modifications are as follows: In AM, rather than computing E, alpha vectors and c i coefficients for the full range of NO-1 data points only the RHS, E vector, was calculated for the full range of NO-1 data points, while alpha vector and c i were calculated only for i and i þ 1 data elements, where i is the missing data element.For the imputation of i the element, only the E i vector for all NO-1 data points, ∝ i vector and c i vectors for i and i þ 1 and b i and d i coefficients were essential for the calculation i th missing element and its imputation.
In addition, using the AM, an effective procedure was demonstrated for the computation of the following cases: (i) missing first and the last element of the data vector, (ii) missing multiple data points at the beginning and the end, and (iii) missing multiple elements anywhere in the data vector.That is in equation ( 18 The formula for RMSE is: We have used RMSE and ToC as evaluation metrics to measure the effectiveness and efficiency of the proposed method because most literature used the same.

Results and discussion
A comparison between LEM and AM is shown here for the imputation of one-min real PMU system data having a size of 1490 data points for each of the 25 heterogeneous variables obtained from five different PMUs.Since our data does not have any missing values, we artificially introduced the missing values, of 10%, 20%, 30% in random.
A sample of one-minute PMU data for five PMUs' was used in the study. 18One minute of PMU data with 10%, 20%, and 30% missing data for five PMUs were evaluated.
When AM was employed, the average RMSE values were 0.83, 1.47, and 2.16 for 10%, 20%, and 30% of missing PMU data, respectively.This can be seen in Figure 1.Moreover, for the same performance, AM showed significant improvements in its ToC as shown in Figure 2. The average ToCs for AM were 1.35, 1.41, and 1.23s when recovering 10%, 20%, and 30% of its missing data.
By comparison, LEM had ToC values of 18.83, 16.02, 16.58s for 10%, 20%, and 30% of its missing data, respectively.The proposed method reduced the ToC by a factor of approximately 10 times.LEM had higher ToC values because it needed to solve the entire set of linear equations every time it needed to find the b i , c i , and d i coefficients.On the other hand, AM only needed to calculate these coefficients at two successive points of i and i+1.

Conclusions
In this study, AM was compared with LEM.However, because of the proliferation of the data, there is a need for customization of this technique to handle a high volume of data to reduce computational time and power.In the proposed method, the approaches demonstrated a reduced computational effort and time of calculation for solving the coefficient vectors.This study has made the following contributions: (i) the recurrent relation-based AM has been effectively employed in the imputation of PMU data and its advantages are demonstrated as an effective and efficient alternative to the conventional technique, and (ii) an effective procedure for handling missing values in special cases (edge, continuous values) is shown, which has not been addressed clearly in other methods.The proposed method has proven effective, and it only requires 10% effort in comparison to the LEM.Future research will focus on the application of the modified recurrent method in the analysis of real-time or stream PMU data.

Underlying data
Harvard Dataverse: Underlying data for 'Modified recurrent equation-based cubic spline interpolation for missing data recovery in phasor measurement unit (PMU)', 'PMU data', https://doi.org/10.7910/DVN/Y2LLJJ. 18 The dataset presented in the work was obtained as real-world data from a regional Electricity authority in India.However, additional information such as the data source, the acquisition procedure, and the significance of the systemic variables are not detailed at this stage of algorithm development as the goal of this preliminary work is to demonstrate the efficacy of the proposed missing data recovery algorithm.
The paper proposed a method to perform missing data recovery on PMU data.It uses a modified approach based on an established technique i.e. recurrent equation-based cubic spline interpolation.Although the paper can be difficult to follow in some sections especially the Methodology, overall, the method and its results do show some promise and is worth further investigation.This paper can be considered for publication if the suggestions below are addressed.
Literature can be improved with a discussion about of some current missing data recovery techniques, especially the technique that the proposed method is based on, that is the recurrent equation-based method and tri-diagonal matrix-based conventional cubic spline interpolation.Those techniques are not discussed in the literature review despite their apparent importance to this paper.
Although not required in this paper due to the need for conciseness and perhaps page limits, it is recommended that the authors use some visual aids/figures/plots to help explain the notion of cubic spline in future manuscripts.
In Page 5, the section heading called "Recurrence equation-based solution" should be "Recurrent equation-based solution" for consistency with the rest of the text.Similarly, please ensure consistency throughout the paper.There are inconsistent terms such as LEM and LE (in Introduction), AM and AM Method (using method is redundant), etc.
Please check the numbering for the equations and the references/citations to them.For example, in Page 5, the authors state "The first row of the new matrix in (6) is shown..." would imply that Equation 6 is a matrix but Equation 6 is just a normal formula.
The dataset is lacking context but understandable due to the confidentiality of the data.It would help if more general characteristics of the dataset are given or an anonymised sampling is shown in a table.That would add depth to the discussion.
The discussion provides valuable insights into the comparison of the two methods, but additional details and context in certain areas could enhance the clarity and completeness of the findings.For example, the use of RMSE as an evaluation metric is common in imputation tasks, but it would be helpful to know if other metrics were considered or if there are specific reasons for choosing RMSE.Similarly, why was ToC used as a metric?
The comparison results that are provided in terms of RMSE and ToC seem to indicate that while Alpha Method has higher RMSE values compared to LE method, AM performs better in terms of time.Providing some insights in the discussion section into why this improvement occurred (e.g., the nature of the algorithm, computational complexity) would enhance the discussion.
Can the proposed method be used for other datasets with missing data or is it optimised specially for PMU data?A brief explanation to clarify the impact of PMU data would further improve the discussion.Reviewer Expertise: Artificial intelligence, cryptography, information security I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Are sufficient details of methods and analysis provided to allow replication by others?
However, the comments are as follows, The Implementation of the proposed method is clearly missing (Hardware or software details used) in the manuscript which does not assist reproducing the results.

1.
Most of the manuscript is dedicated for theoretical discussion about the proposed method.But a comparison between the existing methods with the proposed method is missing.

2.
The nomenclature is improved wherever possible to improve the readability of the manuscript.
The dataset presented in the work was obtained from a regional Electricity Authority in India.It was obtained for use as realistic data and brief details of the PMU data is now included.However, additional information such as the data source, the acquisition process, and the physical significance of the systemic variables are not detailed at this stage of algorithmic development as the main idea is only to demonstrate the efficacy of the missing data imputation algorithm.Nonetheless, we take note of this suggestion for our next submission.Thank you.
In general, there is promising aspect of the proposed method but it has to be conveyed in a clearer manner.Here are my comments.
In Introduction section, the authors state that the comparison will be made with LEM.Can the author explain why specifically LEM is compared?Is that the current state-of-the-art method? 1.
In Literature Review section, NASPI is mentioned but without proper definition of the acronym.

2.
If my understanding is right, Equation ( 10) is a systems of linear equation of (7).Then, why does the h value in the B matrix have an exponent of 1 instead of 2 as of Equation (7)?

3.
Statement above Equation ( 11): Unless I'm mistaken, there is no d coefficient to be solved from either equations ( 5) or (6).

4.
Equation (11): Like Equation ( 10), can the authors clarify why the exponent of 1 is used for h? 5.
Equation ( 13): Why is r taking this value?A bit more explanation would be helpful.6. Equation ( 16): Why is there a 'for' in the equation?7.
Step 3 of LE method: There is no vector E in Equation (11).8.
Step 3 of AM method: There is no vector E in Equation (11).9.
Step 4 of AM method: There is no alpha term in Equation (11).10.
Results and Discussions section: Can the author explicitly write down the equation for RMSE?Also, I am quite surprised with the huge difference in terms of RMSE between the two methods even for the case of 10% missing data considering the same equation ( 18) is used for both algorithms.The difference in ToC is understandable, but the vast difference in RMSE is a bit out of my expectation.Could the author briefly comment on the plausible reason for this huge difference in the RMSE value despite both algorithm using equation (18).

11.
Overall comment: The mathematical derivation is not easy to follow and there are potential mistakes in citing the equations, which makes it even harder to follow.Thus, it is difficult to ascertain whether the results can be reproduced.For pre-submission enquiries, contact research@f1000.com ), when the current values of A [i] are replaced either with A [N-1] or A [i-1] based on the position of missing edge values or continuous values the Time of Calculation (ToC) and Root Mean Squared Error (RMSE) values have improved significantly.
the work clearly and accurately presented and does it cite the current literature?Partly Is the study design appropriate and is the work technically sound?Yes Are sufficient details of methods and analysis provided to allow replication by others?Partly If applicable, is the statistical analysis and its interpretation appropriate?Partly Are all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions drawn adequately supported by the results?Partly Competing Interests: No competing interests were disclosed.

12 .
Is the work clearly and accurately presented and does it cite the current literature?PartlyIs the study design appropriate and is the work technically sound?PartlyAre sufficient details of methods and analysis provided to allow replication by others?NoIf applicable, is the statistical analysis and its interpretation appropriate?PartlyAre all the source data underlying the results available to ensure full reproducibility?PartlyAre the conclusions drawn adequately supported by the results?PartlyCompeting Interests: No competing interests were disclosed.Reviewer Expertise: Dynamical system modellingI confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I haveThe benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more •The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage •